Transient Fault Tolerance via Dynamic Process-Level Redundancy
نویسندگان
چکیده
Transient faults are emerging as a critical concern in the reliability of microprocessors. While hardware reliability techniques are often employed for transient fault tolerance, software techniques represent a more cost-effective and flexible alternative. This paper proposes a software approach to transient fault tolerance which utilizes a run-time system to automatically apply process-level redundancy (PLR). PLR creates a set of redundant processes per application process and compares the processes during run time to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources (i.e. extra threads or cores). PLR is a software-centric approach to transient fault tolerance in which the focus is shifted from ensuring correct hardware execution, to ensuring correct software execution. The software-centric approach is able to ignore many benign faults which do not propagate to affect the program output. In addition, the dynamic deployment creates a very flexible fault tolerant system which transparently applies PLR without prior modifications to the application, shared libraries, or operating system. Experiments using a real PLR prototype on an SMP machine demonstrate that PLR can effectively provide transient fault tolerance with a slowdown of only 1.26x.
منابع مشابه
Configurable Transient Fault Detection via Dynamic Binary Translation
Smaller feature sizes, lower voltage levels, and reduced noise margins have helped improve the performance and lower the power consumption of modern microprocessors. These same advances have made processors more susceptible to transient faults that can corrupt data and make systems unavailable. Designers often compensate for transient faults by adding hardware redundancy and making circuitand p...
متن کاملDesign and Analysis of Transient Fault Tolerance for Multi Core Architecture
This paper describes the software approach of fault tolerance for shared memory multi core system using PLR.PLR uses a software-centric approach transient fault tolerance which ensuring a correct software execution. This scheme is used at user space level which does not necessitate changes to the original application.PLR create a set of redundant process per application process. In this scheme ...
متن کاملExploiting Instruction Redundancy for Transient Fault Tolerance
This paper presents an approach for integrating fault-tolerance t e chniques into microprocessors by utilizing instruction redundancy as well as time redundancy. Smaller and smaller transistors, higher and higher clock frequency, and lower and lower power supply voltage reduce r eliability of microprocessors. In addition, microprocessors are u s e d in systems which require h i g h d e p endabi...
متن کاملEnergy-Aware Synthesis of Fault-Tolerant Schedules for Real-Time Distributed Embedded Systems
In this paper we present an approach to the scheduling and voltage scaling of low-power fault-tolerant hard real-time applications mapped on distributed heterogeneous embedded systems. Processes and messages are statically scheduled, and we use process re-execution for recovering from multiple transient faults. Addressing simultaneously energy and reliability is especially challenging because l...
متن کاملFault-Tolerant Dynamic Systems
Modular redundancy (system replication) and other traditional techniques for fault tolerance in dynamic systems are expensive, and rely heavily — particularly in the case of systems operating over extended time horizons — on the assumption that the error-correcting mechanism (e.g., voting) is fault-free. Herein, we construct redundant dynamic systems in a way that achieves tolerance to transien...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006